A Subsequence-Histogram Method for Generic 
Vocabulary Recognition over Deletion Channels 



Majid Fozunbal 
Hewlett-Packard Laboratories 
Palo Alto, CA 94304 
majid.fozunbal@hp.com 



0^ ' 

o ; 
o : 

Oh 

^ ■ 



m 
cn 

^' 

o 

o 



X 



Abstract — We consider the problem of recognizing a 
vocabulary-a collection of words (sequences) over a finite 
alphabet-from a potential subsequence of one of its words. We 
assume the given subsequence is received through a deletion 
channel as a result of transmission of a random word from one 
of the two generic underlying vocabularies. An exact maximum a 
posterior (MAP) solution for this problem counts the number of 
ways a given subsequence can be derived from particular subsets 
of candidate vocabularies, requiring exponential time or space. 

We present a polynomial approximation algorithm for this 
problem. The algorithm makes no prior assumption about the 
rules and patterns governing the structure of vocabularies. 
Instead, through off-line processing of vocabularies, it extracts 
data regarding regularity patterns in the subsequences of each 
vocabulary. In the recognition phase, the algorithm just uses 
this data, called subsequence-histogram, to decide in favor of one 
of the vocabularies. We provide examples to demonstrate the 
performance of the algorithm and show that it can achieve the 
same performance as MAP in some situations. 

Potential applications include bioinformatics, storage systems, 
and search engines. 

Index termi-Classiflcation, histogram, recognition, search, 
storage, and subsequence. 

I. Introduction 

Consider you have two database of sequences. You ob- 
serve a subsequence that you know is derived from one of 
the databases and you would like to determine which one. 
Alternatively, consider you have two nonlinear codebooks that 
you do not know their generating rules. Suppose a codeword 
is chosen from one of the codebooks, punctured with an 
unknown deletion pattern, and handed to you. Your task is 
to infer which codebook was chosen. As another example, 
suppose you have the vocabularies of two languages: Spanish 
and Italian. You see an abbreviated version of a word in one 
of them and try to decide which language it belongs to. 

We call these appUcations as generic vocabulary recogni- 
tion, in which one seeks to identify an underlying vocabulary 
(collection) based on a potential subsequence of one of its 
words. What make these problems particularly challenging 
are: 1) vocabularies are generic and 2) deletion patterns are 
unknown. Since vocabularies are generic, we do not know 
a common underlying model or structure to simplify their 
descriptions. For example, should we know that the words 
in each vocabulary were generated by an i.i.d. source, then 
a simple regular histogram method could be used to learn 
the underlying distribution for each vocabulary and contrast 



it against that of the received subsequence. Moreover, the 
existence of an unknown deletion pattern makes the problem 
way more complex. Should there exist no deletion channel, 
then a suffix-tree implementation of vocabularies would suffice 
to deal with the problem [1]. In the existence of deletion, 
however, one needs exponential space to develop a generalized 
suffix-tree to enclose subsequences derived from a vocabulary. 

Deletion channels, not to be mistaken for erasure channels, 
have risen in different applications. In a recent survey, Mitzen- 
macher [2] points out some of these applications as well as the 
existing open problems, including the capacity of the binary 
deletion channels. In [3], Diggavi et. al study communication 
over a finite buffer in the framework of deletion channels, and 
in [4], they provide upper bounds for the capacity of these 
channels. As noted by these authors, the main challenge is 
raised because of the unknown deletion patterns. As a result, 
a maximum a posterior (MAP) or a maximum likelihood (ML) 
detector needs to find the transmitted codeword analyzing its 
subsequences. 

Problems involving subsequences are generally hard com- 
binatorial problems whose exact solutions require exponential 
time or space [5], [6]. This is no exception to decoding over 
deletion channels [3] and to our case, in which, we formulate 
the vocabulary recognition problem in an MAP decision 
framework leading to the same bottleneck as in decoding: 
count the number of ways that a received subsequence can 
be obtained from a particular sequence. Unlike the case for 
decoding, here, we also need to compute the aggregated 
number of ways that a received subsequence can be derived 
from certain groups of words. 

Since the computation is NP-Hard, we approximate it by 
the average number of position-wise matches that a received 
subsequence can find in the multiset of same length subse- 
quences derived from a vocabulary. This approximation led to 
an algorithm that we call subsequence-histogram method, due 
to its operational similarities to regular histogram method. In 
fact, one may view regular histogram method as an special 
case of subsequence-histogram method for subsequences of 
unit length. The algorithm requires a space proportional to 
the cube of the maximum length of words and a time that is 
upper bounded by the product of the length of the received 
subsequence and the maximum length of the words. Examples 
are provided to illustrate the process and performance of 
the algorithm. In one simple example, we show that the 



algorithm can achieve the same performance as the exact 
MAP. Another example, simulated numerically, shows a case 
in which the algorithm achieves 10% error rate compared 
to the regular histogram method that has 50% error rate. A 
rigorous performance analysis of the algorithm and assessment 
of its proximity to exact MAP is still work in progress. 

The organization of this paper is as follow. In Section 
mi we discuss the setup of the problem. In Section III, we 
introduce the algorithm and provide examples to demonstrate 
its operation and performance. Finally, we will discuss the 
derivation of the algorithm and its error rate analysis in 
Section IV. 

II. Setup 

Let n denote a discrete time index, and let S be a finite al- 
phabet. Assume two finite vocabularies Ve^, V^a C uf'^j^H^ of 
maximum order L are given. A vocabulary is picked randomly 
with a known probability P{9) and a word W — (wi , . . . ,Wn) 
is picked from it uniformly random. The word W is passed 
through a deletion channel producing S" = (si, . . . , Sm) at the 
output. The channel is i.i.d. in which every letter Wi of the 
word W can be deleted with probability p. The problem is to 
observe S and infer which underlying vocabulary was more 
likely picked. 

Example 2.1: Let S = {0, 1} and assume 



and 



Ve, = {0101,1100} 
Ve, = {1010,0011} 



are two equiprobable underlying vocabularies. If S" = 01 is 
observed, Vg^ is chosen since the likelihood of Ve^ and Ve^ 
are jP^{l — pY and ^p^(l — p)^, respectively. On the other 
hand, observing S = 10, Ve^ is chosen. Intuitively, this is 
because = 10 and 5 = 01 are common subsequences in 
Vei and Vg^, respectively. 

The MAP formulation of this problem is to maximize 
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in which 



P{W\ 
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for every W ^ Ve- The challenge in Eq. ([TJ is to compute 
or approximate P{S\W). Should there be no deletion, i.e., 
p = Q, then the observed sequence S should exactly match a 
word W in either of the dictionaries. In this case, a suffix-tree 
implementation of the vocabularies would require 0(|Ve|L) 
space and allow to identify an exact match of sequence S 
in 0{m) time. For p 7^ 0, a generalized suffix-tree is not 
an option as it needs to enclose the multi-set of all possible 
subsequences of Vg, requiring 0(|Ve|2-^) space, in the worst 
case. 

III. Algorithm 

Here, we introduce an approximation algorithm that needs 
0(|S|L'^) space and operates in 0{mL) time. The algorithm 



TABLE I 

Illustration of the learning phase of algorithm. 



I. Computing $„ 


1. 


Initialization: 






- For n < L, (T e E, and j < n, 






<E> (rr i^i — n 




2. 


Recursion: 






- For W £ Vg, let n = \W\. For any j < 


n. 




*nK,i) = ^n{WjJ) + 1. 




3. 


Termination: For n < L, 






output |S| X 71 matrix $„. 




II. Computing 


1. 


Recursion: 






-For n < L, m < n, i < m, and ct G S, 






^'n,m(a-,i) = J2]=l 




3. 


Termination: For n < L and ni < n, 






output |I]| X m matrix '^n,m- 





has two phases: a learning phase and a recognition phase. 

Learning Phase: In the learning phase, it computes two 
sets of matrices. The first set of matrices are what we call as 
positional histograms. That is for every n < L, we compute 
lEI X n matrices $„ whose elements are 



(2) 



WeVg{n) 

for CT G S and 1 < j < n. Here, 

Vg{n) = {W eVg: \W\ = n} 

denotes the subset of Vg containing all words of length n. 
Eq. (|2]l denotes the number of n-length words whose j-th 
position (from left) is a. Through an off-line process, one 
can adaptively compute ^n{a,j) for every a and j and store 
it in a X n matrix $„. The total space required for this 
purpose is 0(|S|L^). This process is summarized in Table H] 
Example 3.1 (Positional histograms): Consider the two 
given vocabularies in Example 12.11 Then, for n — 4, 



10 2 1 
12 1 



12 1 
10 2 1 



For n = 1, 2, 3, the positional histograms are zero. 

The second set of matrices are what we call as subsequence 
histograms. For every n < L, and m < n, we compute |S| xrn 
matrices 'i'n.m whose elements are 

n 

(j,*) (3) 



where 



i - 1 



rn — i 



(4) 



TABLE II 
The recognition phase. 



Computing Similarity Score 



1. Similarity score: Given S — (si, . . . , Sm), compute 
for every vocabulary Vg and for n < L. 

2. Total similarity score; 

3. Recognition: Vg with maximum Jg{S). 



for (T e S and 1 < i < m denote the number of ways that an 
m-length subsequence can be derived from Ve{n) such that 
its i-th element is a. Each matrix '^n,m may be viewed as 
positional histogram of the multiset of all m-subsequences of 
Ve{n). These matrices can be computed off-line and stored in 
0(|S|L'') space. Table H] illustrates this process. 

Example 3.2 (Subsequence histograms): For the vocabu- 
lary Vfli in Example 12.11 the subsequence histograms are 



*4,2 = 
4'4,4 = 



5 7 

7 5 

1 2 

1 2 



3 4 5 
5 4 3 



For vocabulary Vg^, on the other hand, they are 



4,1 



4,2 



4'4 



4,3 



Note that ^1^4.1 is exactly the same data of the regular his- 
togram. Thus, we may view regular histogram as an spe- 
cial case of subsequence-histogram for subsequences of unit 
length. Moreover, since ^1*4 i is the same for both vocabularies, 
the regular histogram method would be no better than tossing 
a fair coin. In contrast, '^4^2, ^4,3> ^4,4 reflect the differences 
between Vg^ and Vg^, a phenomenon that empowers the 
proposed algorithm enabling it to extract some regularity 
patterns of a vocabulary. 



Recognition Phase: Observing S = (si, 



I, we com- 



pute the subsequence similarity score of order n 



TO ^ — ' 

i=l 



(5) 



This score is an upper bound that is used as an approximation 
for the number of ways that S can be derived from Vg{n). 
Computing this score for all n G {to, . . . , L}, we will obtain 
the total similarity score 
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(6) 



and choose the vocabulary with maximum score. 



Example 3.3 (Achieving MAP Performance): Consider the 
same setup as in Example l2.1l For an observed sequence S, the 
algorithm computes the total similarity score Q and makes 
the following decisions. 

. G {0, 1} ^ draw. 

. e {00, 11} ^ draw; S'^Ol^Ve^; S^lO^Vg,. 

. S-e {100,110}^ Ve^; S" e {001, 011} Ve,; 
5* e {010, 101} ^ draw. 

. 5 6 {0101,1100}^ Ve^; 5* e {1010, 0011} ^ Ve^. 
Should we have used exact MAP, the decision results were 
exactly the same as the algorithm's. Thus, this example shows 
that the algorithm can achieve the same performance as MAP. 
We, however, have no sufficient conditions for it, yet. One can 
also verify that the regular histogram method for the preceding 
example would be no better than tossing a coin. 

Example 3.4 (All combinations): Let S = Ei U S2 and 
Si n E2 ^ 0- Let Ve, = Y.{^ and Vg^ = Sj". Assume that 
the vocabularies are equiprobable and let S G (Ei n E2)™ be 
the observed sequence. Then, decision criterion would be to 
choose the vocabulary with maximum 



(7) 



In other words, we the algorithm decides in favor of a 
vocabulary based on 



(8) 



Should have we used exact MAP, then recognition results 
would have been based on 

— • (9) 



These two criteria only differ in the right hand sides. De- 
pending on the values of parameters, the two methods may or 
may not have the same conclusion. In an spacial case where 
I El I > IE2I and = L2, both methods decide in favor of 
Vgj, which has a smaller size alphabet. In this case, regular 
histogram method has the same results as the vocabularies 
have no specific pattern. 

Example 3.5 (i.i.d. sources): Consider two i.i.d. sources 
over alphabet E = {a, c, g, t}. Let 

1 i I i 1 i 1 i 
4 ~ 50' 4 ~ 100' 4 ^ Too' 4 ^ 50 
I i I i 1 ill 
4 ^ 50' 4 ^ 100' 4 " Too' 4 ~ 50 



Pg 



(10) 
(11) 



denote the density function of two i.i.d. sources on E, in 
which parameter i G {0,1,2,3,4} is a deviation parameter. 
For each vocabulary and for each i, we use the given densities 
to generate 128,000 random words of size 20 to 40. A monte 
carlo error analysis, with 2000 trials, was then conducted. 
Table |lll] summarizes the obtained error rate of the algorithm 
for four different values of the probability of deletion p. As 
i increases, the KL distance between the two distributions 



TABLE III 

Error results for i.i.d. sources. Each vocabulary contains 
128000 words, generated as described in Example|3.5I 





p = 0.1 


p = 0.2 


p = 0.3 


p = 0.4 


1 i = 


0.50 


0.50 


0.50 


0.49 


N = 1 


0.37 


0.38 


0.39 


0.40 ' 


\ i = 2 


0.25 


0.27 


0.27 


0.29 


N = 3 


0.16 


0.18 


0.19 


0.21 


1 i = 4 


0.09 


0.10 


0.12 


0.14 



increases and error rate decreases. In this example, regular 
histogram method has comparable error performance, as words 
are generated completely i.i.d.. 

Example 3.6 (Capturing inherent structures): Using the 
same alphabet of the previous example, two vocabularies are 
generated where the second one is the exact horizontal mirror 
of the first one. The number of words are 128,000. For the 
first vocabulary, the words were created in five different cases. 
In the first case, for each word, a random length of size 20 to 
40 is chosen and each letter of the word is chosen uniformly 
randomly from E. In the second to fifth cases, we simply 
add a prefix 'a', 'ac', 'acg', 'acgt', and 'acgta', to all words, 
respectively. In each case, we then truncate from the end of 
the words as many letters as the length of the prefix added. 
Table |IV] summarizes the results for different cases and for 
four different probability of deletions. Since the vocabularies 
are the exact mirror of each other, regular histogram method 
resulted to 50% error rate in all cases. Note the big reduction 
in error rate across all deletion probability by adding just 
one letter of prefix. For example, for deletion probability of 
p ~ 0.4, the error rate is reduced by half. This shows that 
the algorithm is se successful in capturing existing patterns 
in the vocabularies and sensitive to words permutation. 

IV. Analysis of the algorithm 

Maximum a posterior (MAP) method is equivalent to max- 
imizing Eq. ([T]) in which the main challenge is to compute 
P{S\W). Here, we discuss the derivation of an approximate 
algorithm for it. 

Assume a sequence S of size to = \S\ is observed and let 
be a word of length n = \W\. Sequence S" is a subsequence 
of W, should there exists a warping functiot^ 



4>m = («1 



l), Zl < 22 < 



< im < n 



such that S — <f)m{W) — [wi^, . . . ,Wi^). The function 0m 
may be viewed as an ordered TO-subset in [n]. Thus, 5 is a 
subsequence of W, if the number of such maps, i.e.. 



ijiS,W) = \{<f>rn C [n] : S = (f>,niW)}\ 



(12) 



is non-zero. Using this expression, we can describe the prob- 
ability of observing S conditioned on W as 



PiS\W) = V(^,W')p"""(l -P)' 

'a monotonically increasing sequence of indices. 



(13) 



Consequently, plugging (fT3] l in Eq. (HJ, we obtain a discrimi- 
nant function 



n=m WeVe{n) 

Exact computation of ijj{S, W) requires dynamic program- 
ming that has 0{mn) time complexity. Thus, the computation 

of (O is 0(mi|S|^) when |Ve| = 0(|S|^). 

A. Approximation 
We have 

^ V(5,W^) = E E K<l^m{W) = S) (15) 

in which I(-) is equality indicator function. Approximating I(-) 
with the following upper bound 

^ m 

l{4>-m{W) =S)<-y = S,) 

TO ^ — ' 

i=l 

and substituting in Eq. ( fTSI l, we obtain 

^ mn{W)^S)< 

^EE E u<i^^Aw) = s,). (16) 

The right most summation returns the number of all n-length 
words whose i-\h position under a warping 0„j matches s^. A 
number 



i — 1 I \m — i 



(17) 



of theses warping functions map the j-th position of a word 
W to the i-th position of the observed sequence S. This is the 
number of placement of n distinguishable objects orderly into 
TO bins that place object j into bin i satisfying the identity 



m I ^ — ^ 



(18) 



for every i = 1, . . . , m. For any symbol cr e S, Eq. (|2]i, i.e., 

denotes the number of words whose j-th position (from left) 
is cr. Eq. [3] defined as 

n 
J = l 

denotes the number of m-length subsequences, in the subse- 
quence multiset derived from V0{n), whose i-th element is a. 

Using these notations, we can simplify the expression of 
right hand side of ( fT6b as 

^ m ^ m 

-EE E H'l^mAW) ^ S,) ^ -Y,'fnMs^,i): 



TO 



i=l <#'™ W<£Ve{n) 



TABLE IV 

Example |3.6| demonstrating that the algorithm is capable of 

CAPTURING inherent STRUCTURES IN VOCABULARIES AND SENSITIVE 
TO WORDS PERMUTATION. 



p = 0.l 


p = 0.2 


p = 0.3 


p = 0.4 


Case 1 


0.49 


0.50 


0.49 


0.49 


Case 2 


0.16 


0.19 


0.23 


0.27 


Case 3 


0.08 


0.13 


0.19 


0.23 


Case 4 


0.07 


0.13 


0.19 


0.21 _ 


Case 5 


0.09 


0.14 


0.19 


0.23 ' 



which is the similarity score of S and Ve{n) measured as the 
average number of m-length subsequences that are derived 
from Vein) and that match S in different positions. This 
is Eq. (|5]l in the recognition phase that results to the total 
similarity score (|6]l. 

B. Error analysis 

Probability of error for this algorithm does not have a closed 
form solution. In an special case, in which vocabularies are 
equiprobable and have the same number of words, N, all of 
the same length, n, useful insights can be obtained. In such 
case, observing S of length m, the conditional probability of 
error for an exact MAP solution is 

n 

PMAp(error) = ^ _p»-™(i _ (19) 

where 

film) — min ^{S,W) (20) 

\S\=m * WeVe^ 

is the cardinality of the intersection of multisets of subse- 
quences of length TO = l^l derived from vocabularies. As 
/i(TO) increases, Pmap (error) increases reaching a maximum 
of i when the two vocabularies become identical. 

Expression for the error probability of sequence-histogram 
algorithm replaces /i(m) with 

\S\=mWeVai 

+ Y HS,W)li^g,{S)<^g,{S)). (21) 

Using the inequality l(x < 1) < we will have 
A(to) < J2 \/^oAS)^eAS) 

\S\=m 

<2 

This implies the following upper bound on the probability of 



error of the algorithm: 

n 

F(error)< E-p"-"(l-pr 

m— 

Y \/^edS)^eAS) (23) 

\S\=rn 

Thus, we conclude that subsequences with equivalently large 
{S) and (S) have bigger impact on eiTor. 

V. Conclusion 

The aim of this paper is to demonstrate an approximation 
algorithm for the problem of generic vocabulary recognition 
over deletion channels. Without any prior assumption on the 
structure of vocabularies, the sequence-histogram algorithm 
seeks to extract regularity patterns of a vocabulary through an 
off-line analysis. The algorithm uses this data to choose the 
more likely underlying vocabulary for a received subsequence 
in polynomial time and space. 

A regular histogram method, may be viewed as an special 
case of this algorithm. Unlike a regular histogram, however, 
this algorithm is successful in extracting the structure of a 
vocabulary to dramatically boost its performance. In some 
situations, the algorithm can achieve the same performance 
as exact MAP. However, sufficient conditions to characterize 
such situations are not known, yet. 

Some immediate future directions are: 1) analyzing the 
performance of the algorithm and its proximity to MAP, 2) 
exploring applications in bioinformatics, sequence segmen- 
tation, storage systems, and search engines, 3) extending 
work to multiple observations, represented by subsequences 
of different words within one vocabulary, and 4) generalizing 
model to include substitution and insertion errors. 
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